15 research outputs found
DeepJoin: Joinable Table Discovery with Pre-trained Language Models
Due to the usefulness in data enrichment for data analysis tasks, joinable
table discovery has become an important operation in data lake management.
Existing approaches target equi-joins, the most common way of combining tables
for creating a unified view, or semantic joins, which tolerate misspellings and
different formats to deliver more join results. They are either exact solutions
whose running time is linear in the sizes of query column and target table
repository or approximate solutions lacking precision. In this paper, we
propose Deepjoin, a deep learning model for accurate and efficient joinable
table discovery. Our solution is an embedding-based retrieval, which employs a
pre-trained language model (PLM) and is designed as one framework serving both
equi- and semantic joins. We propose a set of contextualization options to
transform column contents to a text sequence. The PLM reads the sequence and is
fine-tuned to embed columns to vectors such that columns are expected to be
joinable if they are close to each other in the vector space. Since the output
of the PLM is fixed in length, the subsequent search procedure becomes
independent of the column size. With a state-of-the-art approximate nearest
neighbor search algorithm, the search time is logarithmic in the repository
size. To train the model, we devise the techniques for preparing training data
as well as data augmentation. The experiments on real datasets demonstrate that
by training on a small subset of a corpus, Deepjoin generalizes to large
datasets and its precision consistently outperforms other approximate
solutions'. Deepjoin is even more accurate than an exact solution to semantic
joins when evaluated with labels from experts. Moreover, when equipped with a
GPU, Deepjoin is up to two orders of magnitude faster than existing solutions
Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach
Finding joinable tables in data lakes is key procedure in many applications
such as data integration, data augmentation, data analysis, and data market.
Traditional approaches that find equi-joinable tables are unable to deal with
misspellings and different formats, nor do they capture any semantic joins. In
this paper, we propose PEXESO, a framework for joinable table discovery in data
lakes. We embed textual values as high-dimensional vectors and join columns
under similarity predicates on high-dimensional vectors, hence to address the
limitations of equi-join approaches and identify more meaningful results. To
efficiently find joinable tables with similarity, we propose a block-and-verify
method that utilizes pivot-based filtering. A partitioning technique is
developed to cope with the case when the data lake is large and the index
cannot fit in main memory. An experimental evaluation on real datasets shows
that our solution identifies substantially more tables than equi-joins and
outperforms other similarity-based options, and the join results are useful in
data enrichment for machine learning tasks. The experiments also demonstrate
the efficiency of the proposed method.Comment: Full version of paper in ICDE 202
Carrier doping to a partially disordered state in the periodic Anderson model on a triangular lattice
We investigate the effect of hole and electron doping to half-filling in the
periodic Anderson model on a triangular lattice by the Hartree-Fock
approximation at zero temperature. At half-filling, the system exhibits a
partially disordered insulating state, in which a collinear antiferromagnetic
order on an unfrustrated honeycomb subnetwork coexists with nonmagnetic state
at the remaining sites. We find that the carrier doping destabilizes the
partially disordered state, resulting in a phase separation to a doped metallic
state with different magnetic order. The partially disordered state is
restricted to the half-filled insulating case, while its metallic counterpart
is obtained as a metastable state in a narrow electron doped region.Comment: 4 pages, 2 figure
The whole blood transcriptional regulation landscape in 465 COVID-19 infected samples from Japan COVID-19 Task Force
「コロナ制圧タスクフォース」COVID-19患者由来の血液細胞における遺伝子発現の網羅的解析 --重症度に応じた遺伝子発現の変化には、ヒトゲノム配列の個人差が影響する--. 京都大学プレスリリース. 2022-08-23.Coronavirus disease 2019 (COVID-19) is a recently-emerged infectious disease that has caused millions of deaths, where comprehensive understanding of disease mechanisms is still unestablished. In particular, studies of gene expression dynamics and regulation landscape in COVID-19 infected individuals are limited. Here, we report on a thorough analysis of whole blood RNA-seq data from 465 genotyped samples from the Japan COVID-19 Task Force, including 359 severe and 106 non-severe COVID-19 cases. We discover 1169 putative causal expression quantitative trait loci (eQTLs) including 34 possible colocalizations with biobank fine-mapping results of hematopoietic traits in a Japanese population, 1549 putative causal splice QTLs (sQTLs; e.g. two independent sQTLs at TOR1AIP1), as well as biologically interpretable trans-eQTL examples (e.g., REST and STING1), all fine-mapped at single variant resolution. We perform differential gene expression analysis to elucidate 198 genes with increased expression in severe COVID-19 cases and enriched for innate immune-related functions. Finally, we evaluate the limited but non-zero effect of COVID-19 phenotype on eQTL discovery, and highlight the presence of COVID-19 severity-interaction eQTLs (ieQTLs; e.g., CLEC4C and MYBL2). Our study provides a comprehensive catalog of whole blood regulatory variants in Japanese, as well as a reference for transcriptional landscapes in response to COVID-19 infection
DOCK2 is involved in the host genetics and biology of severe COVID-19
「コロナ制圧タスクフォース」COVID-19疾患感受性遺伝子DOCK2の重症化機序を解明 --アジア最大のバイオレポジトリーでCOVID-19の治療標的を発見--. 京都大学プレスリリース. 2022-08-10.Identifying the host genetic factors underlying severe COVID-19 is an emerging challenge. Here we conducted a genome-wide association study (GWAS) involving 2, 393 cases of COVID-19 in a cohort of Japanese individuals collected during the initial waves of the pandemic, with 3, 289 unaffected controls. We identified a variant on chromosome 5 at 5q35 (rs60200309-A), close to the dedicator of cytokinesis 2 gene (DOCK2), which was associated with severe COVID-19 in patients less than 65 years of age. This risk allele was prevalent in East Asian individuals but rare in Europeans, highlighting the value of genome-wide association studies in non-European populations. RNA-sequencing analysis of 473 bulk peripheral blood samples identified decreased expression of DOCK2 associated with the risk allele in these younger patients. DOCK2 expression was suppressed in patients with severe cases of COVID-19. Single-cell RNA-sequencing analysis (n = 61 individuals) identified cell-type-specific downregulation of DOCK2 and a COVID-19-specific decreasing effect of the risk allele on DOCK2 expression in non-classical monocytes. Immunohistochemistry of lung specimens from patients with severe COVID-19 pneumonia showed suppressed DOCK2 expression. Moreover, inhibition of DOCK2 function with CPYPP increased the severity of pneumonia in a Syrian hamster model of SARS-CoV-2 infection, characterized by weight loss, lung oedema, enhanced viral loads, impaired macrophage recruitment and dysregulated type I interferon responses. We conclude that DOCK2 has an important role in the host immune response to SARS-CoV-2 infection and the development of severe COVID-19, and could be further explored as a potential biomarker and/or therapeutic target
User Identity Linkage for Different Behavioral Patterns across Domains
As customers use and benefit from multiple services, a large amount of customer data are accumulating daily. Connecting a customer's identity on a service with her identity on a different service, known as user identity linkage (UIL), enables a comprehensive understanding of users in a variety of real-world applications. The difficulties of UIL tasks in marketing applications are mainly the lack of user demographics and diverse user behavioral patterns, which differs from UIL tasks in social networking services that previous UIL methods have mainly been used to tackle. In this paper, we propose a novel method for UIL for different behavioral patterns to determine whether two given behavioral histories come from the same user without using any user demographics. Our proposed method links users by using natural language processing to efficiently characterize user intrinsic features and bridging the gap between two different behavioral patterns of the same user. We conducted experiments to evaluate our proposed method for three real-world open source datasets and observed that it successfully linked users compared to conventional UIL methods
Learning with Unsure Responses
Many annotation systems provide to add an unsure option in the labels, because the annotators have different expertise, and they may not have enough confidence to choose a label for some assigned instances. However, all the existing approaches only learn the labels with a clear class name and ignore the unsure responses. Due to the unsure response also account for a proportion of the dataset (e.g., about 10-30% in real datasets), existing approaches lead to high costs such as paying more money or taking more time to collect enough size of labeled data. Therefore, it is a significant issue to make use of these unsure.In this paper, we make the unsure responses contribute to training classifiers. We found a property that the instances corresponding to the unsure responses always appear close to the decision boundary of classification. We design a loss function called unsure loss based on this property. We extend the conventional methods for classification and learning from crowds with this unsure loss. Experimental results on realworld and synthetic data demonstrate the performance of our method and its superiority over baseline methods
Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables
Given a large amount of table data, how can we find the tables that contain the contents we want? A naive search fails when the column names are ambiguous, such as if columns containing stock price information are named “Close” in one table and named “P” in another table.One way of dealing with this problem that has been gaining attention is the semantic annotation of table data columns by using canonical knowledge. While previous studies successfully dealt with this problem for specific types of table data such as web tables, it still remains for various other types of table data: (1) most approaches do not handle table data with numerical values, and (2) their predictive performance is not satisfactory.This paper presents a novel approach for table data annotation that combines a latent probabilistic model with multilabel classifiers. It features three advantages over previous approaches due to using highly predictive multi-label classifiers in the probabilistic computation of semantic annotation. (1) It is more versatile due to using multi-label classifiers in the probabilistic model, which enables various types of data such as numerical values to be supported. (2) It is more accurate due to the multi-label classifiers and probabilistic model working together to improve predictive performance. (3) It is more efficient due to potential functions based on multi-label classifiers reducing the computational cost for annotation.Extensive experiments demonstrated the superiority of the proposed approach over state-of-the-art approaches for semantic annotation of real data (183 human-annotated tables obtained from the UCI Machine Learning Repository)